This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.
You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.
The purpose of this project is to study consumer behavior towards Starbucks and use it to develop product recommendations to consumers. by collecting basic consumer data shopping behavior including interest in various suggestions which these elements will make it more accessible to consumers
An effective analysis is required to set goals, understand the overview of the data for the benefit of modeling, summarize the data and expand in the future. I have divided them as follows:
Metrics
Exploratory Data Analysis (EDA)
Modeling
In this project, my emphasis is on "accuracy" and "error" which shows how well the model performs. We compare the analysis using algorithms from the scikit-learn library to determine the suitability for this data set.
#import libary.
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import math
import json
import datetime
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# load in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
Overviews of Portfolio, show details of offer_type in id.
#Check the basics of the data.
portfolio
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 |
portfolio.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reward 10 non-null int64 1 channels 10 non-null object 2 difficulty 10 non-null int64 3 duration 10 non-null int64 4 offer_type 10 non-null object 5 id 10 non-null object dtypes: int64(3), object(3) memory usage: 608.0+ bytes
portfolio.shape
(10, 6)
portfolio.columns
Index(['reward', 'channels', 'difficulty', 'duration', 'offer_type', 'id'], dtype='object')
Overviews of Profile, show details of customer. There is a total of 1700 data which consists of gender, age, id, become_member_on and income.
#Check the basics of the data.
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
profile.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17000 entries, 0 to 16999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14825 non-null object 1 age 17000 non-null int64 2 id 17000 non-null object 3 became_member_on 17000 non-null int64 4 income 14825 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 664.2+ KB
profile.shape
(17000, 5)
profile.columns
Index(['gender', 'age', 'id', 'became_member_on', 'income'], dtype='object')
profile.age.min(), profile.age.max()
(18, 118)
profile.became_member_on.min(), profile.became_member_on.max()
(20130729, 20180726)
profile.income.min(), profile.income.max()
(30000.0, 120000.0)
profile.index
RangeIndex(start=0, stop=17000, step=1)
#Set None of gender is "O"
for x in range(profile.shape[0]):
if (profile.gender[x] == "F") | (profile.gender[x] == "M"):
continue
else:
profile.gender[x] = "O"
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | O | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | O | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | O | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
Overviews of Transcript, show details of customer. There is a total of 306534 data which consists of person, event, value, time.
#Check the basics of the data.
transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
transcript.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 306534 entries, 0 to 306533 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person 306534 non-null object 1 event 306534 non-null object 2 value 306534 non-null object 3 time 306534 non-null int64 dtypes: int64(1), object(3) memory usage: 9.4+ MB
transcript.shape
(306534, 4)
transcript.columns
Index(['person', 'event', 'value', 'time'], dtype='object')
transcript.time.min(), transcript.time.max()
(0, 714)
#Amount of portfolio, profile and transcript.
print("Amount of portfolio: ", portfolio.shape[0])
print("Amount of profile: ",profile.shape[0])
print("Amount of transcript: ",transcript.shape[0])
Amount of portfolio: 10 Amount of profile: 17000 Amount of transcript: 306534
#Get data of portfolio ot df_portfolio.
df_portfolio = portfolio
df_portfolio
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 |
#Frequency of reward.
df_portfolio.reward.value_counts()
5 3 0 2 2 2 10 2 3 1 Name: reward, dtype: int64
#Frequency of difficulty.
df_portfolio.difficulty.value_counts()
10 4 0 2 5 2 20 1 7 1 Name: difficulty, dtype: int64
#Frequency of duration.
df_portfolio.duration.value_counts()
7 4 5 2 10 2 3 1 4 1 Name: duration, dtype: int64
#Frequency of offer_type.
df_portfolio.offer_type.value_counts()
discount 4 bogo 4 informational 2 Name: offer_type, dtype: int64
#Add columns follows as channels.
channels_val = set()
for val in df_portfolio.channels:
channels_val.update(set(val))
for val in channels_val:
df_portfolio[val] = df_portfolio.channels.apply(lambda x: 1 if val in x else 0)
df_portfolio
| reward | channels | difficulty | duration | offer_type | id | web | social | mobile | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 0 | 1 |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 0 | 1 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 1 | 0 | 0 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 | 1 | 1 | 1 | 1 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 | 1 | 1 | 1 | 1 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 | 1 | 0 | 1 | 1 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d | 1 | 1 | 1 | 1 |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
#Counts of channels by email, social, web, mobile.
print("Frequency of channels from email: ", df_portfolio.email.sum())
print("Frequency of channels from social: ", df_portfolio.social.sum())
print("Frequency of channels from web: ", df_portfolio.web.sum())
print("Frequency of channels from mobile: ", df_portfolio.mobile.sum())
Frequency of channels from email: 10 Frequency of channels from social: 6 Frequency of channels from web: 8 Frequency of channels from mobile: 9
#The sum of the channels that affect offer_type.
offer_type_by_chans = df_portfolio[['offer_type', 'email', 'social', 'web', 'mobile']]
offer_type_by_chans['sum'] = df_portfolio['email'] + df_portfolio['social'] + df_portfolio['web'] + df_portfolio['mobile']
offer_type_by_chans
| offer_type | social | web | mobile | sum | ||
|---|---|---|---|---|---|---|
| 0 | bogo | 1 | 1 | 0 | 1 | 3 |
| 1 | bogo | 1 | 1 | 1 | 1 | 4 |
| 2 | informational | 1 | 0 | 1 | 1 | 3 |
| 3 | bogo | 1 | 0 | 1 | 1 | 3 |
| 4 | discount | 1 | 0 | 1 | 0 | 2 |
| 5 | discount | 1 | 1 | 1 | 1 | 4 |
| 6 | discount | 1 | 1 | 1 | 1 | 4 |
| 7 | informational | 1 | 1 | 0 | 1 | 3 |
| 8 | bogo | 1 | 1 | 1 | 1 | 4 |
| 9 | discount | 1 | 0 | 1 | 1 | 3 |
#Calucation of each channels.
offer_type_by_chans.iloc[:, :5].describe()
| social | web | mobile | ||
|---|---|---|---|---|
| count | 10.0 | 10.000000 | 10.000000 | 10.000000 |
| mean | 1.0 | 0.600000 | 0.800000 | 0.900000 |
| std | 0.0 | 0.516398 | 0.421637 | 0.316228 |
| min | 1.0 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.0 | 0.000000 | 1.000000 | 1.000000 |
| 50% | 1.0 | 1.000000 | 1.000000 | 1.000000 |
| 75% | 1.0 | 1.000000 | 1.000000 | 1.000000 |
| max | 1.0 | 1.000000 | 1.000000 | 1.000000 |
#Sort data.
offer_type_by_chans.sort_values(by=['offer_type'])
| offer_type | social | web | mobile | sum | ||
|---|---|---|---|---|---|---|
| 0 | bogo | 1 | 1 | 0 | 1 | 3 |
| 1 | bogo | 1 | 1 | 1 | 1 | 4 |
| 3 | bogo | 1 | 0 | 1 | 1 | 3 |
| 8 | bogo | 1 | 1 | 1 | 1 | 4 |
| 4 | discount | 1 | 0 | 1 | 0 | 2 |
| 5 | discount | 1 | 1 | 1 | 1 | 4 |
| 6 | discount | 1 | 1 | 1 | 1 | 4 |
| 9 | discount | 1 | 0 | 1 | 1 | 3 |
| 2 | informational | 1 | 0 | 1 | 1 | 3 |
| 7 | informational | 1 | 1 | 0 | 1 | 3 |
#Conclusions.
offer_type_by_chans.groupby(['offer_type']).sum()
| social | web | mobile | sum | ||
|---|---|---|---|---|---|
| offer_type | |||||
| bogo | 4 | 3 | 3 | 4 | 14 |
| discount | 4 | 2 | 4 | 3 | 13 |
| informational | 2 | 1 | 1 | 2 | 6 |
According to the portfolio data,
#Relation of reward, difficulty and duration.
plt.figure(figsize=(10,5))
(df_portfolio.duration/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='duration');
(df_portfolio.difficulty/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='difficulty');
(df_portfolio.reward/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='reward');
plt.ylabel('Score(%)', fontsize = 12)
plt.xlabel('Index', fontsize = 12)
plt.title('Relation of reward, difficulty and duration.');
#The most impactful communication channel.
(offer_type_by_chans.groupby(['offer_type']).sum().iloc[:, :4]).plot(kind='box', figsize=(10,5));
plt.ylabel('Frequency of channels.', fontsize = 12);
plt.xlabel('Channels.', fontsize = 12);
plt.title('Relation of channels and frequency.', fontsize = 15);
As above, email communication methods are the most popular and social communication channels are the least popular.
#Get data of portfolio ot df_portfolio.
df_profile = profile
df_profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | O | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | O | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | O | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
df_profile.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17000 entries, 0 to 16999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 17000 non-null object 1 age 17000 non-null int64 2 id 17000 non-null object 3 became_member_on 17000 non-null int64 4 income 14825 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 664.2+ KB
df_profile['Date'] = df_profile.became_member_on.apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d'))
df_profile['year'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[:4]))
df_profile['month'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[4:6]))
df_profile['days'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[6:]))
df_profile.head()
| gender | age | id | became_member_on | income | Date | year | month | days | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | O | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN | 2017-02-12 | 2017 | 2 | 12 |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 | 2017-07-15 | 2017 | 7 | 15 |
| 2 | O | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN | 2018-07-12 | 2018 | 7 | 12 |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 | 2017-05-09 | 2017 | 5 | 9 |
| 4 | O | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN | 2017-08-04 | 2017 | 8 | 4 |
#Counts of gender.
df_profile.gender.value_counts()
M 8484 F 6129 O 2387 Name: gender, dtype: int64
#Counts of age.
df_profile.age.value_counts()
118 2175
58 408
53 372
51 363
54 359
...
100 12
96 8
98 5
99 5
101 5
Name: age, Length: 85, dtype: int64
#Values of income.
df_profile.income.value_counts()
73000.0 314
72000.0 297
71000.0 294
57000.0 288
74000.0 282
...
116000.0 46
112000.0 45
107000.0 45
117000.0 32
120000.0 13
Name: income, Length: 91, dtype: int64
#Counts of year.
df_profile.year.value_counts()
2017 6469 2018 4198 2016 3526 2015 1830 2014 691 2013 286 Name: year, dtype: int64
#Counts of month.
df_profile.month.value_counts().sort_index(ascending=True)
1 1525 2 1202 3 1329 4 1315 5 1307 6 1265 7 1359 8 1610 9 1515 10 1568 11 1449 12 1556 Name: month, dtype: int64
# Select gender is male and female.
df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')].head()
| gender | age | id | became_member_on | income | Date | year | month | days | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 | 2017-07-15 | 2017 | 7 | 15 |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 | 2017-05-09 | 2017 | 5 | 9 |
| 5 | M | 68 | e2127556f4f64592b11af22de27a7932 | 20180426 | 70000.0 | 2018-04-26 | 2018 | 4 | 26 |
| 8 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 20180209 | 53000.0 | 2018-02-09 | 2018 | 2 | 9 |
| 12 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 20171111 | 51000.0 | 2017-11-11 | 2017 | 11 | 11 |
#Check for unspecified values.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')].head()
| gender | age | id | became_member_on | income | Date | year | month | days | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | O | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN | 2017-02-12 | 2017 | 2 | 12 |
| 2 | O | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN | 2018-07-12 | 2018 | 7 | 12 |
| 4 | O | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN | 2017-08-04 | 2017 | 8 | 4 |
| 6 | O | 118 | 8ec6ce2a7e7949b1bf142def7d0e0586 | 20170925 | NaN | 2017-09-25 | 2017 | 9 | 25 |
| 7 | O | 118 | 68617ca6246f4fbc85e91a2a49552598 | 20171002 | NaN | 2017-10-02 | 2017 | 10 | 2 |
According to the result above, age is 118 year ago and income is null.
#The number of times the gender could not be determined.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['gender'].value_counts()[0]
2387
#Number of purchases unable to determine gender.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['age'].value_counts().count()
67
Income of gender is missing data.
#The buyer's income could not be determined.
df_profile['income'].isnull().value_counts()[1]
2175
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['income'].isnull().value_counts()
True 2175 False 212 Name: income, dtype: int64
df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')]['income'].isnull().value_counts()
False 14613 Name: income, dtype: int64
Accoring to the above result, All missing income data within gerder has not been determined. Therefore, consider to removing gender is O (can't identify).
#Clean the data by removing gender = None.
df_profile = df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')]
df_profile.head()
| gender | age | id | became_member_on | income | Date | year | month | days | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 | 2017-07-15 | 2017 | 7 | 15 |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 | 2017-05-09 | 2017 | 5 | 9 |
| 5 | M | 68 | e2127556f4f64592b11af22de27a7932 | 20180426 | 70000.0 | 2018-04-26 | 2018 | 4 | 26 |
| 8 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 20180209 | 53000.0 | 2018-02-09 | 2018 | 2 | 9 |
| 12 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 20171111 | 51000.0 | 2017-11-11 | 2017 | 11 | 11 |
#Sort data #became_member_on
df_profile = df_profile.sort_values(by=['Date'], ascending=False)
df_profile = df_profile.drop(labels='became_member_on', axis=1)
df_profile = df_profile.rename(columns={'Date': 'became_member_on'})
df_profile["val"] = 1
#Check the data.
df_profile.head()
| gender | age | id | income | became_member_on | year | month | days | val | |
|---|---|---|---|---|---|---|---|---|---|
| 10282 | M | 46 | e4989ca3d8974e28a3fe87aa0b0ce327 | 74000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 |
| 14757 | M | 59 | 3d50b3815c8e4f3b84f46be4d1b64cd3 | 34000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 |
| 12902 | M | 85 | f1f01be4e2344bf08b7f488c115ad374 | 45000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 |
| 4357 | M | 76 | 6597e5ab63754726b83ba2032f008d26 | 92000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 |
| 13249 | F | 45 | 42f00b6b150e45149c50db9bd642bca6 | 70000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 |
#Group by year.
df_profile[['age', 'year', 'income','val']].groupby(['year']).sum()
| age | income | val | |
|---|---|---|---|
| year | |||
| 2013 | 14207 | 15597000.0 | 271 |
| 2014 | 33754 | 38296000.0 | 658 |
| 2015 | 85799 | 105177000.0 | 1568 |
| 2016 | 169872 | 211236000.0 | 2992 |
| 2017 | 298867 | 362194000.0 | 5510 |
| 2018 | 192352 | 223712000.0 | 3614 |
#age and income by gender and year.
df_profile[['gender','age', 'year', 'income', 'val']].groupby(['gender','year']).sum()
| age | income | val | ||
|---|---|---|---|---|
| gender | year | |||
| F | 2013 | 4202 | 4567000.0 | 74 |
| 2014 | 7343 | 8468000.0 | 138 | |
| 2015 | 38083 | 47701000.0 | 663 | |
| 2016 | 88385 | 112994000.0 | 1509 | |
| 2017 | 140503 | 174463000.0 | 2443 | |
| 2018 | 74177 | 88844000.0 | 1302 | |
| M | 2013 | 10005 | 11030000.0 | 197 |
| 2014 | 26411 | 29828000.0 | 520 | |
| 2015 | 47716 | 57476000.0 | 905 | |
| 2016 | 81487 | 98242000.0 | 1483 | |
| 2017 | 158364 | 187731000.0 | 3067 | |
| 2018 | 118175 | 134868000.0 | 2312 |
#Data of age.
df_profile.age
10282 46
14757 59
12902 85
4357 76
13249 45
..
10933 67
1200 52
7796 58
569 36
8948 64
Name: age, Length: 14613, dtype: int64
#keep the age range.
range_age = []
for val in df_profile.age:
if val >= 100:
cat = 10
elif val >= 90:
cat = 9
elif val >= 80:
cat = 8
elif val >= 70:
cat = 7
elif val >= 60:
cat = 6
elif val >= 50:
cat = 5
elif val >= 40:
cat = 4
elif val >= 30:
cat = 3
elif val >= 20:
cat = 2
elif val >= 10:
cat = 1
elif val >= 0:
cat = 0
range_age.append(cat)
len(range_age)
14613
#Add columns the range_of_age in df_profile.
df_profile['range_of_age'] = range_age
#Keep data of range age.
age_range = {0:'Age: 0-9',
1:'Age: 10-19',
2:'Age: 20-29',
3:'Age: 30-39',
4:'Age: 40-49',
5:'Age: 50-59',
6:'Age: 60-69',
7:'Age: 70-79',
8:'Age: 80-89',
9:'Age: 90-99',
10:'Age: >=100'}
#Check the data.
df_profile.head()
| gender | age | id | income | became_member_on | year | month | days | val | range_of_age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10282 | M | 46 | e4989ca3d8974e28a3fe87aa0b0ce327 | 74000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 4 |
| 14757 | M | 59 | 3d50b3815c8e4f3b84f46be4d1b64cd3 | 34000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 5 |
| 12902 | M | 85 | f1f01be4e2344bf08b7f488c115ad374 | 45000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 8 |
| 4357 | M | 76 | 6597e5ab63754726b83ba2032f008d26 | 92000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 7 |
| 13249 | F | 45 | 42f00b6b150e45149c50db9bd642bca6 | 70000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 4 |
df_profile.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 14613 entries, 10282 to 8948 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14613 non-null object 1 age 14613 non-null int64 2 id 14613 non-null object 3 income 14613 non-null float64 4 became_member_on 14613 non-null datetime64[ns] 5 year 14613 non-null int64 6 month 14613 non-null int64 7 days 14613 non-null int64 8 val 14613 non-null int64 9 range_of_age 14613 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(6), object(2) memory usage: 1.2+ MB
#Consumer income by age and gender.
df_profile[['range_of_age','gender', 'income', 'val']].groupby(['range_of_age','gender']).sum().sort_index()
| income | val | ||
|---|---|---|---|
| range_of_age | gender | ||
| 1 | F | 2388000.0 | 45 |
| M | 7962000.0 | 160 | |
| 2 | F | 20372000.0 | 393 |
| M | 48819000.0 | 960 | |
| 3 | F | 27136000.0 | 495 |
| M | 53498000.0 | 1008 | |
| 4 | F | 54590000.0 | 835 |
| M | 85121000.0 | 1434 | |
| 5 | F | 119579000.0 | 1560 |
| M | 127330000.0 | 1925 | |
| 6 | F | 102946000.0 | 1350 |
| M | 103758000.0 | 1602 | |
| 7 | F | 63568000.0 | 832 |
| M | 61054000.0 | 925 | |
| 8 | F | 34019000.0 | 450 |
| M | 25181000.0 | 371 | |
| 9 | F | 11413000.0 | 155 |
| M | 6291000.0 | 97 | |
| 10 | F | 1026000.0 | 14 |
| M | 161000.0 | 2 |
#Calculation
df_profile[['range_of_age','gender', 'income']].groupby(['gender','range_of_age']).describe()
| income | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | ||
| gender | range_of_age | ||||||||
| F | 1 | 45.0 | 53066.666667 | 12393.913462 | 31000.0 | 42000.0 | 55000.0 | 62000.0 | 73000.0 |
| 2 | 393.0 | 51837.150127 | 12750.107803 | 30000.0 | 41000.0 | 52000.0 | 62000.0 | 75000.0 | |
| 3 | 495.0 | 54820.202020 | 15663.141575 | 30000.0 | 42000.0 | 53000.0 | 66000.0 | 100000.0 | |
| 4 | 835.0 | 65377.245509 | 19016.658344 | 30000.0 | 53000.0 | 64000.0 | 78000.0 | 119000.0 | |
| 5 | 1560.0 | 76653.205128 | 21942.156507 | 30000.0 | 59000.0 | 77000.0 | 93000.0 | 120000.0 | |
| 6 | 1350.0 | 76256.296296 | 21957.694371 | 30000.0 | 60000.0 | 76000.0 | 93000.0 | 120000.0 | |
| 7 | 832.0 | 76403.846154 | 22446.682588 | 30000.0 | 60000.0 | 77000.0 | 93000.0 | 119000.0 | |
| 8 | 450.0 | 75597.777778 | 22548.349328 | 31000.0 | 60000.0 | 75000.0 | 92000.0 | 120000.0 | |
| 9 | 155.0 | 73632.258065 | 21707.326956 | 31000.0 | 57000.0 | 74000.0 | 89000.0 | 119000.0 | |
| 10 | 14.0 | 73285.714286 | 24408.159328 | 42000.0 | 56750.0 | 70500.0 | 92500.0 | 118000.0 | |
| M | 1 | 160.0 | 49762.500000 | 12653.330036 | 30000.0 | 39000.0 | 49000.0 | 59000.0 | 74000.0 |
| 2 | 960.0 | 50853.125000 | 13194.079099 | 30000.0 | 39000.0 | 50000.0 | 62000.0 | 75000.0 | |
| 3 | 1008.0 | 53073.412698 | 14791.989279 | 30000.0 | 41000.0 | 52000.0 | 64000.0 | 100000.0 | |
| 4 | 1434.0 | 59359.135286 | 17904.609716 | 30000.0 | 45000.0 | 58000.0 | 71000.0 | 119000.0 | |
| 5 | 1925.0 | 66145.454545 | 21157.749163 | 30000.0 | 50000.0 | 65000.0 | 81000.0 | 120000.0 | |
| 6 | 1602.0 | 64767.790262 | 21398.973426 | 30000.0 | 49000.0 | 63000.0 | 79000.0 | 120000.0 | |
| 7 | 925.0 | 66004.324324 | 21666.549684 | 30000.0 | 50000.0 | 64000.0 | 81000.0 | 119000.0 | |
| 8 | 371.0 | 67873.315364 | 21250.694477 | 30000.0 | 52000.0 | 67000.0 | 81500.0 | 120000.0 | |
| 9 | 97.0 | 64855.670103 | 21965.877748 | 30000.0 | 51000.0 | 60000.0 | 81000.0 | 120000.0 | |
| 10 | 2.0 | 80500.000000 | 24748.737342 | 63000.0 | 71750.0 | 80500.0 | 89250.0 | 98000.0 | |
#Popularity of using statucks each year.
fig = px.histogram(data_frame=df_profile, x="year", y='income', color="year", pattern_shape="gender")
fig.update_layout(title_text='Relation of year and income.', bargap=0.1)
fig.show()
#Range of age and income.
fig = px.histogram(data_frame=df_profile, x="range_of_age", y='income')
fig.update_layout(title_text='Relation of age range and income.', bargap=0.1)
fig.show()
#Ralation of year, age and gender for Starbucks.
fig = px.strip(df_profile, x="age", y="year", orientation="h", color="gender")
fig.show()
#Ralation of range_of_age, year, income and gender for Starbucks.
fig = px.scatter_matrix(df_profile, dimensions=["range_of_age", "year", "income"], color="gender")
fig.show()
#Get data of transcript ot df_transcript.
df_transcript = transcript
df_transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
#Get keys of value.
df_transcript.value.apply(lambda x: list(x.keys())[0])
0 offer id
1 offer id
2 offer id
3 offer id
4 offer id
...
306529 amount
306530 amount
306531 amount
306532 amount
306533 amount
Name: value, Length: 306534, dtype: object
#Get values of value.
df_transcript.value.apply(lambda x: list(x.values())[0])
0 9b98b8c7a33c4b65b9aebfe6a799e6d9
1 0b1e1539f2cc45b7b9fa7c272da2e1d7
2 2906b810c7d4411798c6938adc9daaa5
3 fafdcd668e3743c1bb461111dcafc2a4
4 4d5c57ea9a6940dd891ad53e9dbe8da0
...
306529 1.59
306530 9.53
306531 3.61
306532 3.53
306533 4.05
Name: value, Length: 306534, dtype: object
#Add value_keys and values from value in df_transcript.
df_transcript['value_keys'] = df_transcript.value.apply(lambda x: list(x.keys())[0])
df_transcript['values'] = df_transcript.value.apply(lambda x: list(x.values())[0])
df_transcript.head()
| person | event | value | time | value_keys | values | |
|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 | offer id | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 | offer id | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 | offer id | 2906b810c7d4411798c6938adc9daaa5 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 | offer id | fafdcd668e3743c1bb461111dcafc2a4 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 | offer id | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
#Drop value table.
df_transcript = df_transcript.drop(labels='value', axis=1)
df_transcript.head()
| person | event | time | value_keys | values | |
|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | offer id | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | 0 | offer id | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | offer id | 2906b810c7d4411798c6938adc9daaa5 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | 0 | offer id | fafdcd668e3743c1bb461111dcafc2a4 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | 0 | offer id | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
#Get columns.
df_transcript.columns
Index(['person', 'event', 'time', 'value_keys', 'values'], dtype='object')
#The number of values in df_transcript.
df_transcript.shape
(306534, 5)
#The number of each person.
df_transcript.person.value_counts()
94de646f7b6041228ca7dec82adb97d2 51
8dbfa485249f409aa223a2130f40634a 49
d0a80415b84c4df4908b8403b19765e3 48
5e60c6aa3b834e44b822ea43a3efea26 48
79d9d4f86aca4bed9290350fb43817c2 48
..
fccc9279ba56411f80ffe8ce7e0935cd 2
df9fc9a86ca84ef5aedde8925d5838ba 2
cae5e211053f4121a389a7da4d631f7f 2
e63e42480aae4ede9f07cac49c8c3f78 2
da7a7c0dcfcb41a8acc7864a53cf60fb 1
Name: person, Length: 17000, dtype: int64
#The number of each event.
df_transcript.event.value_counts()
transaction 138953 offer received 76277 offer viewed 57725 offer completed 33579 Name: event, dtype: int64
#Not offer.
(df_transcript.event == 'transaction').sum() / (df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum())*100
82.91691778900949
#An offer that received a response.
x = df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum()
((df_transcript.event == 'offer viewed').sum() / x ) * 100
34.44602908444275
#An offer that has received a response has been completed.
x = df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum()
((df_transcript.event == 'offer completed').sum() / x ) * 100
print(x)
167581
#The number of each time.
df_transcript.time.value_counts()
408 17030
576 17015
504 16822
336 16302
168 16150
...
318 940
330 938
156 914
162 910
150 894
Name: time, Length: 120, dtype: int64
#The number of each value_keys.
df_transcript.value_keys.value_counts()
amount 138953 offer id 134002 offer_id 33579 Name: value_keys, dtype: int64
#The number of each values.
df_transcript['values'].value_counts()
fafdcd668e3743c1bb461111dcafc2a4 20241
2298d6c36e964ae4a3e7e9706d1fb8c2 20139
f19421c1d4aa40978ebb69ca19b0e20d 19131
4d5c57ea9a6940dd891ad53e9dbe8da0 18222
ae264e3637204a6fb9bb56bc8210ddfd 18062
...
380.24 1
378.35 1
138.36 1
87.71 1
43.29 1
Name: values, Length: 5113, dtype: int64
#The data of event and values with separate id and amount.
df_transcript[['event', 'values']].value_counts()
event values
offer received 9b98b8c7a33c4b65b9aebfe6a799e6d9 7677
0b1e1539f2cc45b7b9fa7c272da2e1d7 7668
ae264e3637204a6fb9bb56bc8210ddfd 7658
2298d6c36e964ae4a3e7e9706d1fb8c2 7646
2906b810c7d4411798c6938adc9daaa5 7632
...
transaction 49.2 1
49.23 1
49.38 1
49.48 1
1062.28 1
Length: 5131, dtype: int64
df_transcript.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 306534 entries, 0 to 306533 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person 306534 non-null object 1 event 306534 non-null object 2 time 306534 non-null int64 3 value_keys 306534 non-null object 4 values 306534 non-null object dtypes: int64(1), object(4) memory usage: 11.7+ MB
#Found missing the data.
df_transcript.isnull().value_counts()
person event time value_keys values False False False False False 306534 dtype: int64
#Set transcript_event by event is transaction.
transcript_event = df_transcript[df_transcript['event'] == 'transaction']
transcript_event.head()
| person | event | time | value_keys | values | |
|---|---|---|---|---|---|
| 12654 | 02c083884c7d45b39cc68e1314fec56c | transaction | 0 | amount | 0.83 |
| 12657 | 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f | transaction | 0 | amount | 34.56 |
| 12659 | 54890f68699049c2a04d415abc25e717 | transaction | 0 | amount | 13.23 |
| 12670 | b2f1cd155b864803ad8334cdf13c4bd2 | transaction | 0 | amount | 19.51 |
| 12671 | fe97aa22dd3e48c8b143116a8403dd52 | transaction | 0 | amount | 18.97 |
transcript_event.shape
(138953, 5)
#Set transcript_value_keys by value_keys is amount.
transcript_value_keys = df_transcript[df_transcript['value_keys'] == 'amount']
transcript_value_keys.head()
| person | event | time | value_keys | values | |
|---|---|---|---|---|---|
| 12654 | 02c083884c7d45b39cc68e1314fec56c | transaction | 0 | amount | 0.83 |
| 12657 | 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f | transaction | 0 | amount | 34.56 |
| 12659 | 54890f68699049c2a04d415abc25e717 | transaction | 0 | amount | 13.23 |
| 12670 | b2f1cd155b864803ad8334cdf13c4bd2 | transaction | 0 | amount | 19.51 |
| 12671 | fe97aa22dd3e48c8b143116a8403dd52 | transaction | 0 | amount | 18.97 |
transcript_value_keys.shape
(138953, 5)
From transcript_event and transcript_value_keys tell have a relationship with each other.
# The number of each event.
plt.figure(figsize=(10,5))
df_transcript.event.value_counts().plot(kind='bar');
plt.ylabel('Sum of event.', fontsize = 12)
plt.xlabel('Event.', fontsize = 12)
plt.title('Amount of event', fontsize = 15);
# Check the first 5 lines of df_portfolio.
df_portfolio
| reward | channels | difficulty | duration | offer_type | id | web | social | mobile | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 0 | 1 |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 0 | 1 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 1 | 0 | 0 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 | 1 | 1 | 1 | 1 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 | 1 | 1 | 1 | 1 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 | 1 | 0 | 1 | 1 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d | 1 | 1 | 1 | 1 |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
# Check the first 5 lines of df_profile.
df_profile.head()
| gender | age | id | income | became_member_on | year | month | days | val | range_of_age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10282 | M | 46 | e4989ca3d8974e28a3fe87aa0b0ce327 | 74000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 4 |
| 14757 | M | 59 | 3d50b3815c8e4f3b84f46be4d1b64cd3 | 34000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 5 |
| 12902 | M | 85 | f1f01be4e2344bf08b7f488c115ad374 | 45000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 8 |
| 4357 | M | 76 | 6597e5ab63754726b83ba2032f008d26 | 92000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 7 |
| 13249 | F | 45 | 42f00b6b150e45149c50db9bd642bca6 | 70000.0 | 2018-07-26 | 2018 | 7 | 26 | 1 | 4 |
# # Check the first 5 lines of df_transcript.
df_transcript.head()
| person | event | time | value_keys | values | |
|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | offer id | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | 0 | offer id | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | offer id | 2906b810c7d4411798c6938adc9daaa5 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | 0 | offer id | fafdcd668e3743c1bb461111dcafc2a4 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | 0 | offer id | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
Shape of the data of each dataset.
print(df_portfolio.shape)
print(df_profile.shape)
print(df_transcript.shape)
(10, 10) (14613, 10) (306534, 5)
# Marge Dataframe of df_transcript and df_profile.
df_trans_prof = df_transcript.merge(df_profile, right_on='id', how='right', left_on='person')
df_trans_prof = df_trans_prof.drop(columns=['month','days','id'], axis=0)
df_trans_prof.head()
| person | event | time | value_keys | values | gender | age | income | became_member_on | year | val | range_of_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e4989ca3d8974e28a3fe87aa0b0ce327 | offer received | 0 | offer id | 3f207df678b143eea3cee63160fa8bed | M | 46 | 74000.0 | 2018-07-26 | 2018 | 1 | 4 |
| 1 | e4989ca3d8974e28a3fe87aa0b0ce327 | offer viewed | 6 | offer id | 3f207df678b143eea3cee63160fa8bed | M | 46 | 74000.0 | 2018-07-26 | 2018 | 1 | 4 |
| 2 | e4989ca3d8974e28a3fe87aa0b0ce327 | transaction | 78 | amount | 11.0 | M | 46 | 74000.0 | 2018-07-26 | 2018 | 1 | 4 |
| 3 | e4989ca3d8974e28a3fe87aa0b0ce327 | transaction | 102 | amount | 29.2 | M | 46 | 74000.0 | 2018-07-26 | 2018 | 1 | 4 |
| 4 | e4989ca3d8974e28a3fe87aa0b0ce327 | offer received | 168 | offer id | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | M | 46 | 74000.0 | 2018-07-26 | 2018 | 1 | 4 |
# Unanswered events.
df_trans_prof[df_trans_prof['value_keys'] == 'amount']['event'].value_counts()
transaction 122176 Name: event, dtype: int64
#The transaction is not received offer.
df_trans_prof[(df_trans_prof['event'] == 'transaction') & (df_trans_prof['value_keys'] != 'amount')]['event'].value_counts()
Series([], Name: event, dtype: int64)
#The event received an offer
df_trans_prof[df_trans_prof['value_keys'] != 'amount']['event'].value_counts()
offer received 65585 offer viewed 49087 offer completed 31943 Name: event, dtype: int64
# Frequency of use of services by range_of_age.
df_trans_prof.range_of_age.value_counts()
5 62749 6 53146 4 41713 7 31037 3 29794 2 26669 8 14991 9 4417 1 3962 10 313 Name: range_of_age, dtype: int64
# Events that affect the age range by gender.
df_trans_prof.groupby(['event','gender'])[['age','income']].sum()
| age | income | ||
|---|---|---|---|
| event | gender | ||
| offer completed | F | 897242 | 1.134670e+09 |
| M | 886192 | 1.085293e+09 | |
| offer received | F | 1577891 | 1.955927e+09 |
| M | 1987588 | 2.333494e+09 | |
| offer viewed | F | 1194201 | 1.475803e+09 |
| M | 1487631 | 1.756511e+09 | |
| transaction | F | 2747515 | 3.299407e+09 |
| M | 3692523 | 4.257817e+09 |
# Check null in df_trans_prof of income table.
df_trans_prof[df_trans_prof.income.isnull() == True].value_counts()
Series([], dtype: int64)
Find people, The person receiving the offer.
# Marge Dataframe of df_transcript and df_portfolio.
df_trans_sum = df_transcript.merge(df_portfolio, right_on='id', how='right', left_on='values')
df_trans_sum = df_trans_sum.drop(columns=['channels','values'], axis=0)
df_trans_sum.head()
| person | event | time | value_keys | reward | difficulty | duration | offer_type | id | web | social | mobile | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4b0da7e80e5945209a1fdddfe813dbe0 | offer received | 0 | offer id | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 1 | 1e9420836d554513ab90eba98552d0a9 | offer received | 0 | offer id | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 2 | 02c083884c7d45b39cc68e1314fec56c | offer received | 0 | offer id | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 3 | 676506bad68e4161b9bbaffeb039626b | offer received | 0 | offer id | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 4 | fe8264108d5b4f198453bbb1fa7ca6c9 | offer received | 0 | offer id | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
# Puplarity of offer_types.
df_trans_sum.offer_type.value_counts()
bogo 71617 discount 69898 informational 26066 Name: offer_type, dtype: int64
df_trans_sum.groupby(['offer_type','id'])[['event']].count()
| event | ||
|---|---|---|
| offer_type | id | |
| bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 18222 |
| 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 16202 | |
| ae264e3637204a6fb9bb56bc8210ddfd | 18062 | |
| f19421c1d4aa40978ebb69ca19b0e20d | 19131 | |
| discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 13751 |
| 2298d6c36e964ae4a3e7e9706d1fb8c2 | 20139 | |
| 2906b810c7d4411798c6938adc9daaa5 | 15767 | |
| fafdcd668e3743c1bb461111dcafc2a4 | 20241 | |
| informational | 3f207df678b143eea3cee63160fa8bed | 11761 |
| 5a8bc65990b245e5a138643cd4eb9837 | 14305 |
df_trans_sum.groupby(['offer_type','id'])[['email','mobile','web','social']].sum()
| mobile | web | social | |||
|---|---|---|---|---|---|
| offer_type | id | ||||
| bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 18222 | 18222 | 18222 | 18222 |
| 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 16202 | 16202 | 16202 | 0 | |
| ae264e3637204a6fb9bb56bc8210ddfd | 18062 | 18062 | 0 | 18062 | |
| f19421c1d4aa40978ebb69ca19b0e20d | 19131 | 19131 | 19131 | 19131 | |
| discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 13751 | 0 | 13751 | 0 |
| 2298d6c36e964ae4a3e7e9706d1fb8c2 | 20139 | 20139 | 20139 | 20139 | |
| 2906b810c7d4411798c6938adc9daaa5 | 15767 | 15767 | 15767 | 0 | |
| fafdcd668e3743c1bb461111dcafc2a4 | 20241 | 20241 | 20241 | 20241 | |
| informational | 3f207df678b143eea3cee63160fa8bed | 11761 | 11761 | 11761 | 0 |
| 5a8bc65990b245e5a138643cd4eb9837 | 14305 | 14305 | 0 | 14305 |
Because, we are promoting and want to check what kind of offers customers are interested in therefore aimed at "offer_complete" then simulate what kind of offer is received. It is divided into "BOGO" and "DISCOUNT" with the following requirements:
Modeling using the sklearn library has the following algorithms:
#Include columns of interest.
df_trans_prof_offer = df_trans_prof[df_trans_prof['value_keys'] == 'offer id']
df_trans_prof_offer = df_trans_prof.merge(df_portfolio, right_on='id', how='right', left_on='values')
df_trans_prof_offer = df_trans_prof_offer.drop(columns=['channels', 'values', 'became_member_on', 'val',
'time', 'income', 'age', 'difficulty',
'duration','gender', 'reward', 'person'], axis=0)
df_trans_prof_offer = df_trans_prof_offer[df_trans_prof_offer.event == 'offer completed']
df_trans_prof_offer
| event | value_keys | year | range_of_age | offer_type | id | web | social | mobile | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | offer completed | offer_id | 2018 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 15 | offer completed | offer_id | 2018 | 2 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 17 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 20 | offer completed | offer_id | 2018 | 6 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 24 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146603 | offer completed | offer_id | 2013 | 3 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146607 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146609 | offer completed | offer_id | 2013 | 2 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146612 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146614 | offer completed | offer_id | 2013 | 5 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
31943 rows × 10 columns
#Chrck offer_type in offer completed.
df_trans_prof_offer['offer_type'].value_counts()
discount 16930 bogo 15013 Name: offer_type, dtype: int64
df_trans_prof_offer.shape[0]
31943
#Create a "offer_type_code" column for keep the type by 1(gobo) and 0(discount)
df_trans_prof_offer['offer_type_code'] = 0
for i in range(df_trans_prof_offer.shape[0]):
if df_trans_prof_offer.offer_type.iloc[i] == "bogo":
df_trans_prof_offer['offer_type_code'].iloc[i] = 1
else:
df_trans_prof_offer['offer_type_code'].iloc[i] = 0
df_trans_prof_offer
| event | value_keys | year | range_of_age | offer_type | id | web | social | mobile | offer_type_code | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | offer completed | offer_id | 2018 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 | 1 |
| 15 | offer completed | offer_id | 2018 | 2 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 | 1 |
| 17 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 | 1 |
| 20 | offer completed | offer_id | 2018 | 6 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 | 1 |
| 24 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146603 | offer completed | offer_id | 2013 | 3 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 | 0 |
| 146607 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 | 0 |
| 146609 | offer completed | offer_id | 2013 | 2 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 | 0 |
| 146612 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 | 0 |
| 146614 | offer completed | offer_id | 2013 | 5 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 | 0 |
31943 rows × 11 columns
df_process = df_trans_prof_offer.iloc[:, :10]
df_process
| event | value_keys | year | range_of_age | offer_type | id | web | social | mobile | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | offer completed | offer_id | 2018 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 15 | offer completed | offer_id | 2018 | 2 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 17 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 20 | offer completed | offer_id | 2018 | 6 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| 24 | offer completed | offer_id | 2018 | 5 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146603 | offer completed | offer_id | 2013 | 3 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146607 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146609 | offer completed | offer_id | 2013 | 2 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146612 | offer completed | offer_id | 2013 | 6 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
| 146614 | offer completed | offer_id | 2013 | 5 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
31943 rows × 10 columns
#Set columns.
cat_cols_lst = df_process.columns
cat_cols = cat_cols_lst
cat_cols
Index(['event', 'value_keys', 'year', 'range_of_age', 'offer_type', 'id',
'email', 'web', 'social', 'mobile'],
dtype='object')
for col in cat_cols:
df_process = pd.concat([df_process.drop(col, axis=1), pd.get_dummies(df_process[col],
prefix=col,
prefix_sep='_',
drop_first=True,
dummy_na=False)], axis=1)
df_process.head()
| year_2014 | year_2015 | year_2016 | year_2017 | year_2018 | range_of_age_2 | range_of_age_3 | range_of_age_4 | range_of_age_5 | range_of_age_6 | ... | id_2298d6c36e964ae4a3e7e9706d1fb8c2 | id_2906b810c7d4411798c6938adc9daaa5 | id_4d5c57ea9a6940dd891ad53e9dbe8da0 | id_9b98b8c7a33c4b65b9aebfe6a799e6d9 | id_ae264e3637204a6fb9bb56bc8210ddfd | id_f19421c1d4aa40978ebb69ca19b0e20d | id_fafdcd668e3743c1bb461111dcafc2a4 | web_1 | social_1 | mobile_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| 15 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| 17 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| 20 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| 24 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
5 rows × 25 columns
df_process.columns
Index(['year_2014', 'year_2015', 'year_2016', 'year_2017', 'year_2018',
'range_of_age_2', 'range_of_age_3', 'range_of_age_4', 'range_of_age_5',
'range_of_age_6', 'range_of_age_7', 'range_of_age_8', 'range_of_age_9',
'range_of_age_10', 'offer_type_discount',
'id_2298d6c36e964ae4a3e7e9706d1fb8c2',
'id_2906b810c7d4411798c6938adc9daaa5',
'id_4d5c57ea9a6940dd891ad53e9dbe8da0',
'id_9b98b8c7a33c4b65b9aebfe6a799e6d9',
'id_ae264e3637204a6fb9bb56bc8210ddfd',
'id_f19421c1d4aa40978ebb69ca19b0e20d',
'id_fafdcd668e3743c1bb461111dcafc2a4', 'web_1', 'social_1', 'mobile_1'],
dtype='object')
# Set X is df_process
X = df_process
# Create column into response variables.
y = df_trans_prof_offer['offer_type_code']
# Split into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)
#LinearRegression
model_linear = LinearRegression(normalize=True)
#Fit
model_linear.fit(X_train, y_train)
#Predict model.
y_test_preds = model_linear.predict(X_test)
y_train_preds = model_linear.predict(X_train)
#Accuracy of model by LinearRegression.
acc = model_linear.score(X_test, y_test_preds)
print('Train Accuracy: %.2f' % (acc*100))
acc = model_linear.score(X_train, y_train_preds)
print('Train Accuracy: %.2f' % (acc*100))
#Loss of model by LinearRegression.
test_score = mean_squared_error(y_test, y_test_preds)
train_score = mean_squared_error(y_train, y_train_preds)
print('mean_squared_error: ' , test_score)
print('mean_squared_error: ' , train_score)
#r2_score of model by LogisticRegression.
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('r2_score: ' , test_score)
print('r2_score: ' , train_score)
#Compare data.
print('\ny_test[:10]:', list(y_test[:10]))
print('y_train[:10]:', list(y_train[:10]))
print('y_test_preds_linear[:10]:', y_test_preds[:10])
print('y_test_preds_linear[:10]', y_train_preds[:10])
Train Accuracy: 100.00 Train Accuracy: 100.00 mean_squared_error: 5.150667009329443e-31 mean_squared_error: 5.184103150848712e-31 r2_score: 1.0 r2_score: 1.0 y_test[:10]: [1, 0, 1, 1, 1, 0, 0, 0, 0, 1] y_train[:10]: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1] y_test_preds_linear[:10]: [ 1.00000000e+00 -1.11022302e-16 1.00000000e+00 1.00000000e+00 1.00000000e+00 -5.55111512e-16 4.99600361e-16 -3.33066907e-16 1.11022302e-16 1.00000000e+00] y_test_preds_linear[:10] [ 1.11022302e-16 -5.55111512e-16 -6.66133815e-16 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00]
#LogisticRegression
model_logis = LogisticRegression()
#Fit
model_logis.fit(X_train, y_train)
#Predict model.
y_test_preds = model_logis.predict(X_test)
y_train_preds = model_logis.predict(X_train)
#Accuracy of model by LogisticRegression.
acc = accuracy_score(y_test, y_test_preds)
print('Train Accuracy: %.2f' % (acc*100))
acc = accuracy_score(y_train, y_train_preds)
print('Train Accuracy: %.2f' % (acc*100))
#mean_squared_error of model by LogisticRegression.
test_score = mean_squared_error(y_test, y_test_preds)
train_score = mean_squared_error(y_train, y_train_preds)
print('mean_squared_error: ' , test_score)
print('mean_squared_error: ' , train_score)
#r2_score of model by LogisticRegression.
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('r2_score: ' , test_score)
print('r2_score: ' , train_score)
#Compare data.
print('\ny_test[:10]:', list(y_test[:10]))
print('y_train[:10]:', list(y_train[:10]))
print('y_test_preds_logis[:10]:', y_test_preds[:10])
print('y_train_preds_logis[:10]', y_train_preds[:10])
#Confusion Matrix.
print('\nConfusion Matrix')
conf_matrix = confusion_matrix(y_test, y_test_preds)
print(conf_matrix)
accuracy = (conf_matrix[0][0] + conf_matrix[-1][-1]) / conf_matrix.sum()
print('\nAccuracy is', accuracy*100)
#Report by plot_confusion_matrix() function of matplotlib
classifier = model_logis.fit(X_train, y_train);
disp = plot_confusion_matrix(classifier, X_test, y_test, cmap=plt.cm.Blues);
print('\n', disp.confusion_matrix)
print(classification_report(y_test, y_test_preds))
Train Accuracy: 100.00
Train Accuracy: 100.00
mean_squared_error: 0.0
mean_squared_error: 0.0
r2_score: 1.0
r2_score: 1.0
y_test[:10]: [1, 0, 1, 1, 1, 0, 0, 0, 0, 1]
y_train[:10]: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_test_preds_logis[:10]: [1 0 1 1 1 0 0 0 0 1]
y_train_preds_logis[:10] [0 0 0 1 1 1 1 1 1 1]
Confusion Matrix
[[5051 0]
[ 0 4532]]
Accuracy is 100.0
[[5051 0]
[ 0 4532]]
precision recall f1-score support
0 1.00 1.00 1.00 5051
1 1.00 1.00 1.00 4532
accuracy 1.00 9583
macro avg 1.00 1.00 1.00 9583
weighted avg 1.00 1.00 1.00 9583
According to modeling, LinearRegression and LogisticRegression give 100% accuracy, but different MSE and predict values. Obviously, the LinearRegression model gives a non-zero error while LogisticRegression give accurate results with a precision of 100% and an error of 0 in both forms MSE and RMSE.
The analysis process contains non-identifiable information such as gender, income, and exaggerated age. causing the data that has been erroneous from reality. Therefore, I removed the unclear information. almost of consumers have a wide range of ages. in order to be able to distinguish the presentation better. Therefore, I divided my age into ranges and predict the acceptance of the offer according to the age range. However, the presentation channel has changed with more years.
1. Portfolio:
Specify the details of the information presented in various formats. The "BOGO" format offers the most, of the most popular channels being email, and the least being modile. I think the presentation channel should reference the human behavior of technology as much as possible.
2. Profile:
Indicates the history of the consumer According to the preliminary analysis It had the highest number of new user registrations in 2017, with an initial age range of 18, and was most popular with users between the ages of 50 - 60. get high according to the proportion of men and Ling are similar But it can be seen that in 2013 male consumers were more popular than women. However, there are users that we cannot identify. because it is a user who does not register who do not know gender, income, including salary.
3. Transcript:
Specify the details of the consumer's use of the service. Up to 82.92% of users are registered customers. which may be a new customer or customers who are not registered with the store This prevents these users from receiving offers from Starbucks stores. However, the existing customer database still has a higher volume. This means that the product is in demand by the general public. At this point we pay attention to the response to the offers that the customers receive. It can be seen that 34.45% of people were interested in the offer and only 20% completed the offer. Out of a total of 167,581 proposal submissions.
Modeling will compare the two models using the Scikit-learn library, LinearRegression and LogisticRegression, which have precision MSE and RMSE as indicators. To predict consumer demand for offers from "offer_complete" with variables in years, age ranges, offer channels. and the format of the offer that affects the patterns that customers choose to receive In this modeling, only the offers of "BOGO" and "DISCOUNT" are interested.
1. LinearRegression model The accuracy is 100% and the RMSE is 100% but the MSE is not equal to zero. When comparing the data from the predictions, it was found that the values were distorted from reality.
2. LogisticRegression model The accuracy was 100% and the RMSE was 100% but the MSE was 0. When comparing the predicted data, they were found to be the same.
That means that the LogisticRegression model more appropriate for this data set. LinearRegression Even if it was confirmed with 100% accuracy. However, despite how suitable this model is, But this has already eliminated the information needed to create the model. And this model does not include individual customer attribute data. But this is considered successful in early modeling because the results are consistent with the existing data.
According to modeling Let us know what kind of "GOBO" and "DISCOUNT" people are interested in by determining the age range, years, and promotion channels. I think this allows us to reach more consumers. However, having information on customer needs should increase the likelihood of a second response to a promotion. and make customers come back to use the service more I'm talking about delivering products to individual consumers based on their preferences.
There are only two model comparisons, so the SVM analysis, Naive Classification, K-Nearest Neighbors (K-NN), etc. would be the most likely extensions for this analysis.